Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init commit optimistic relay #1

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

michaelneuder
Copy link
Owner

@michaelneuder michaelneuder commented Dec 29, 2022

there are a few potential paths we could take to set up an optimistic relay. below are two that i considered:

Option 1. Leave the builder status as is (HighPrio, LowPrio, Blacklisted). Modify the relay logic to only run block simulation for LowPrio builders, thus redefining HighPrio as optimistic. Thus when we call the prio-load-balancer (https://github.com/flashbots/prio-load-balancer) we only ever use the LowPrio queue.
Pros

  • No modification of redis KV store or postgres db schema.
  • Smallest amount of code change.
    Cons
  • Reduces our flexibility. We no longer can differentiate between LowPrio and HighPrio builders who are in the pessimistic mode.

Option 2. Add a new builder status called Optimisitic. Thus we effectively have 4 builder statuses: Optimistic > HighPrio > LowPrio > Blacklisted.
Pros

  • More flexibility as it allows us to maintain the distinction between different priorities on the pessimistic builders.
  • Doesn't conflate HighPrio, which has a different semantic meaning in https://github.com/flashbots/prio-load-balancer with Optimisitic, which is a new concept we introduce.
    Cons
  • More code.
  • Requires schema and redis updates.
  • Cognitive load to explain new hierarchy of builder statuses.

Based on the above, I propose Option 2 for its flexiblity.

initial plan for changes required to implement Option 2:

  1. add a new builder state called Optimistic.
  2. modify the handleSubmitNewBlock Builder API handler to skip the block simulation for blocks from builders who are in Optimistic state. these blocks will be pushed into an OptimisticBlock buffered channel where their validity will be checked asynchronously. If any of them turn out to be invalid, we will demote that block builder to LowPrio status (or could even blacklist them if we want to be conservative).
  3. modify the handleGetHeader Proposer API handler to check for best bid block (could do this in-band or through a similar buffered channel mechanism as we do for all incoming blocks). if it turns out that the winning bid was an invalid block for one of our proposers (and they ended up publishing it), we need to reimburse them using the collateral posted by the builder of the invalid block (still need to flesh this out a bit).

open questions:

  1. how do we ensure that we only accept optimistic blocks if the previous slot block has been validated?

@JustinDrake
Copy link

I propose Option 2 for its flexiblity

Agreed Option 2 is best :) A nice-to-have goal is for the optimistic relay code to be upstreamed to https://github.com/flashbots/mev-boost-relay, and for that we should avoid redefining things like HighPrio.

@blombern
Copy link

Great work! I also agree on option 2.

One question about point 2 of the implementation: I'm assuming the OptimisticBlock channel would still only talk to the prio-load-balancer (when enabled), and not to the block validation nodes directly?

I'm also wondering if there's a possible edge case where there are optimistic blocks buffered in memory when the service is shut down (can happen whenever), and we end up never validating them. The worst-case block validation time can reach up to 6s. 95th percentile is way lower, but the distribution has a very long right tail so it does happen. Just thinking out loud.

Excited to see this idea come to life :)

@michaelneuder
Copy link
Owner Author

michaelneuder commented Dec 31, 2022

Ok I added more to this implementation. I break down the changes into 3 main sequences:

a) Block Submission Flow
b) Optimistic Block Processing
c) Block Proposal Flow

I will describe each of these in more detail and the changes required for them.


Screen Shot 2022-12-31 at 1 56 25 PM

  1. Block builders submit blocks to the Builder API endpoint of the relay.
  2. The block builder API goroutine fetches the status of the block builder from redis.
  3. Based on the status of the builder, 3 paths can be taken.
    a. if the builder is in optimistic mode, send the block to the Optimistic Block Channel.
    b. if the builder is in high prio mode, send the block to the prio-load-balancer.
    c. if the builder is in low prio mode, send the block to the prio-load-balancer.
  4. If the builder in not in optimistic mode, wait until the block has been successfully simulated on the validation nodes.
  5. Update the builders current bid in redis.

Notice that for builders in optimistic mode, we update the bid after sending the block to the Optimistic Block Channel, without validating it. This is where the improved performance can be achieved.


Screen Shot 2022-12-31 at 1 57 06 PM

  1. The OptimisticBlockProcessor receives a block from the Optimistic Block Channel (this happens in a different goroutine than the Builder API).
  2. The OptimisticBlockProcessor sends the block as high prio to the prio-load-balancer.
  3. The block is simulated on the validating nodes, and the status is returned to the OptimisticBlockProcessor.
  4. If the simulation is failed, the status of the builder is updated to no longer be Optimistic (currently I am setting it as low prio).

This goroutine handles the simulation of all the blocks that we optimistically skipped in the Block Submission Flow. If we determine that a builder submitted an invalid block, we change their status and stop optimistically handling their blocks.


Screen Shot 2022-12-31 at 1 57 51 PM

  1. mev-boost calls getHeader on the Proposer API of the relay. This is part of the MEV-Block Block Proposal as documented in https://docs.flashbots.net/flashbots-mev-boost/architecture-overview/block-proposal.
  2. mev-boost calls getPayload on the Proposer API of the relay. This triggers the publication of a SignedBeaconBlock.
  3. In a separate goroutine, the Proposer API sends the block that was just proposed to the prio-load-balancer.
  4. The block is simulated using the validation nodes and the status is returned to the Proposer API.
  5. If the block simulation failed, that means we need to refund the proposer. We insert a new row into the ProposerRefundTable with the details of the bid and the SignedBlindedBlock, which contains enough information to confirm that the refund is owed.

This flow represents the process of checking if our optimistic block processing ever results in a validator submitting an invalid block. Since block builders will post collateral, this will be used to reimburse the validator. Since refunds should be a relatively rare event, we plan on handling them manually when they occur.

@michaelneuder
Copy link
Owner Author

based on the advice of @JustinDrake, I added a check for the amount of collateral from the Builder, which is set in the database and redis. If their collateral is less than a block that the propose, then we fall back to high prio queue for that block. I chose to make this state change not permanent, because they might have just found one really profitable block, and it could turn out to be correct, so we shouldn't penalize them just because their collateral can't cover the block value.

I am going to create a state machine diagram outlining the status and collateral fields of the builder in the redis cache and the db.

One open question is how we get the redis cache updated with the collatoral values we set in the DB. Still need to iron this out a bit.

@michaelneuder
Copy link
Owner Author

also an additional note about the DB. It seems like the migration code is in place so that people who are already running the relay can include upstream changes and the DB can remain in place while just changing the fields. I modified the 001-initdb file directly, which would make these changes incompatible with relays that are already running. This might be something that we want to consider refactoring to make this backwards compatible, but the migration logic would be non-trivial. (e.g., we need to update the builder status based on the values of the existing rows rather than just dropping the column and adding a new one).

@@ -286,6 +298,9 @@ func (api *RelayAPI) StartServer() (err error) {
if api.opts.BlockBuilderAPI {
// Get current proposer duties blocking before starting, to have them ready
api.updateProposerDuties(bestSyncStatus.HeadSlot)

// TODO(mikeneuder): consider if we should use >1 optimisticBlockProcessors.
go api.startOptimisticBlockProcessor()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimisticBlockProcessors are extremely lightweight, right? Why have more than 1?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it depends how many optimistic blocks we get. startOptimisticBlockProcessor waits for the response of the block simulation from the prio-load-balancer, so if we only have one goroutine running this we are processing all the optimistic blocks serially. i thought it might be desirable to have a few goroutines sending the simulation requests because we have multiple validation nodes running. thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends how many optimistic blocks we get

We should expect to get hammered with optimistic blocks, partly because all the builders will want to connect to the ultra sound relay, but also partly because we can relax limit QoS limits (like max 2 blocks/sec).

if we only have one goroutine running this we are processing all the optimistic blocks serially

My naive understanding was that startOptimisticBlockProcessor calls simulateBlock, which calls api.blockSimRateLimiter.send, which itself uses parallelism.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is the call path where we call api.BlockSimRateLimiter.send. that function can handle a bunch of concurrent blocks, but it still waits on the response from the simulation for the single block that we sent. so if we only have a single go startOptimisticBlockProcessor() I think we are only processing one block from that channel at a time.

e.g.,

  1. we receive a block from the channel
  2. we call api.BlockSimRateLimiter.send
  3. we wait until send returns to get the simulation error

whereas we could/should have multiple go routines listening on the channel, and concurrent calls to api.BlockSimRateLimiter.send. maybe i am missing something though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent calls to api.BlockSimRateLimiter.send

Ok, agreed we should have concurrent calls to api.BlockSimRateLimiter.send :) I don't have a particularly informed opinion as to the most natural way to do it here.

@michaelneuder
Copy link
Owner Author

OK another rather large structural change in: 3a37075

  1. Add a field called collateral_id to the BlockBuilder table. We will use this to identify builders that use multiple builder pubkeys, but want to have the same collateral backstopping them.
  2. Add GetBlockBuildersFromCollateralID to the DB service which fetches all the builder pubkeys that have the same collateral id.
  3. Add demoteBuildersCollateralID to the API service which uses a single builder pubkey to fetch all pubkeys that share a collateral id and demotes them in both redis and the db.
  4. Call demoteBuildersCollateralID from startOptimisticBlockProcessor and handleGetPayload if the simulation fails in either case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants